Variant Discovery ◾ 113
unique, and stable, unlike descriptive names, which can be used differently by different
people. For example, the NCBI dbSNP assigns an ID with “rs” prefix to the accepted human
variants with asserted positions mapped to a reference sequence as reference variants
(RefSNP) and also it assigns an ID with “ss” prefix for a variant submitted with flanking
sequence. Figure 4.1 shows two dbSNP IDs for reference variants in the VCF ID columns.
Variants may have identifiers from multiple databases. You will see these different types of
identifiers used throughout the literature and in other databases. Different types of identi-
fiers are used for short variants and structural variants.
4.1.2. Variant Calling and Analysis
Variant calling is the process by which we can identify variants on sequence data. The
sequence data are usually stored in FASTQ files obtained from whole genome, whole
exome sequencing, or targeted gene sequencing. The reads in the FASTQ files are assessed
for quality and then preprocessed to ensure that final reads are of high quality. The reads
are then aligned to a reference genome and the read alignment information are stored in
BAM files. The BAM files are then used as an input for variant calling programs for vari-
ant identification and analysis. The identified variants are written in a VCF file. A single
VCF file can hold thousands of variants and genotypes of multiple samples. The genetic
studies usually focus on the germline variant calling, where the reference genome used for
mapping the reads is standard for the species of interest; that will allow us to identify geno-
types. The somatic variant calling is used to study diseases like cancer. In somatic variant
calling, the reference is a related tissue from the same individual (e.g., healthy tissue in the
case of cancer). Here, we expect to see genetic mosaicism between cells or presence of more
than one genetic line as a result of genetic mutations.
A variant calling workflow begins with raw sequencing data for multiple samples or
individuals and ends with a single VCF file containing only the genomic positions where at
least one individual in the population has a variant due to mutations. After variants have
been called, they can be analyzed in different ways. For example, we may wish to deter-
mine which genes are affected by the variants, what consequences they have on them, and
the phenotypes associated with them. Thus, variants that have been called can be anno-
tated with their consequences and can also be associated to certain phenotypes and the
results can be interpreted to answer some research questions.
Before digging into the steps of variant calling and analysis, it is better to distin-
guish between the types of genetic variation studies. There are several types of genetic
variation studies but generally they can be classified into (i) Genome-wide association
studies (GWASs) [3], (ii) studies on consequences of variants [4], and (iii) Population
genetics [5].
The GWASs involve genotyping a sample of individuals at common variants across the
genome using a genome-wide survey for variants. Variants associated with a phenotype
will be found at a higher frequency. This kind of studies are carried out on individuals to
identify variants and their associated phenotypes as variants causing the phenotype will be
at higher frequency in the affected individual than in the control. The phenotype–genotype
associations must be supported by statistical evidence based on the population studied.